Introduction to TidyR
Gathering columns into key-value pairs
The most important function in tidyr is gather(). It should be used when you have columns that are not variables and you want to collapse them into key-value pairs.
The easiest way to visualize the effect of gather() is that it makes wide datasets long. As you saw, running the following command on wide_df will make it long:
gather(wide_df, my_key, my_val, -col)
# Read BMI data
library(readr)
bmi <- read_csv("../xDatasets/bmi_clean.csv")
# Apply gather() to bmi and save the result as bmi_long
library(tidyr)
bmi_long <- gather(bmi, year, bmi_val, -Country)
# View the first 20 rows of the result
bmi_long %>%
head(20) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| Country | year | bmi_val |
|---|---|---|
| Afghanistan | Y1980 | 21.48678 |
| Albania | Y1980 | 25.22533 |
| Algeria | Y1980 | 22.25703 |
| Andorra | Y1980 | 25.66652 |
| Angola | Y1980 | 20.94876 |
| Antigua and Barbuda | Y1980 | 23.31424 |
| Argentina | Y1980 | 25.37913 |
| Armenia | Y1980 | 23.82469 |
| Australia | Y1980 | 24.92729 |
| Austria | Y1980 | 24.84097 |
| Azerbaijan | Y1980 | 24.49375 |
| Bahamas | Y1980 | 24.21064 |
| Bahrain | Y1980 | 23.97588 |
| Bangladesh | Y1980 | 20.51918 |
| Barbados | Y1980 | 24.36372 |
| Belarus | Y1980 | 24.90898 |
| Belgium | Y1980 | 25.09879 |
| Belize | Y1980 | 24.54345 |
| Benin | Y1980 | 20.80754 |
| Bermuda | Y1980 | 25.07881 |
Notice how now, instead of being represented in the column names, years are now all neatly represented in the year column. Try checking dim(bmi_long) and dim(bmi) before moving on.
Spreading key-value pairs into columns
The opposite of gather() is spread(), which takes key-values pairs and spreads them across multiple columns. This is useful when values in a column should actually be column names (i.e. variables). It can also make data more compact and easier to read.
The easiest way to visualize the effect of spread() is that it makes long datasets wide. As you saw, running the following command will make long_df wide:
spread(long_df, my_key, my_val)
# Apply spread() to bmi_long
bmi_wide <- spread(bmi_long, year, bmi_val)
# View the head of bmi_wide
bmi_wide %>%
head(8) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| Country | Y1980 | Y1981 | Y1982 | Y1983 | Y1984 | Y1985 | Y1986 | Y1987 | Y1988 | Y1989 | Y1990 | Y1991 | Y1992 | Y1993 | Y1994 | Y1995 | Y1996 | Y1997 | Y1998 | Y1999 | Y2000 | Y2001 | Y2002 | Y2003 | Y2004 | Y2005 | Y2006 | Y2007 | Y2008 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Afghanistan | 21.48678 | 21.46552 | 21.45145 | 21.43822 | 21.42734 | 21.41222 | 21.40132 | 21.37679 | 21.34018 | 21.29845 | 21.24818 | 21.20269 | 21.14238 | 21.06376 | 20.97987 | 20.91132 | 20.85155 | 20.81307 | 20.78591 | 20.75469 | 20.69521 | 20.62643 | 20.59848 | 20.58706 | 20.57759 | 20.58084 | 20.58749 | 20.60246 | 20.62058 |
| Albania | 25.22533 | 25.23981 | 25.25636 | 25.27176 | 25.27901 | 25.28669 | 25.29451 | 25.30217 | 25.30450 | 25.31944 | 25.32357 | 25.28452 | 25.23077 | 25.21192 | 25.22115 | 25.25874 | 25.31097 | 25.33988 | 25.39116 | 25.46555 | 25.55835 | 25.66701 | 25.77167 | 25.87274 | 25.98136 | 26.08939 | 26.20867 | 26.32753 | 26.44657 |
| Algeria | 22.25703 | 22.34745 | 22.43647 | 22.52105 | 22.60633 | 22.69501 | 22.76979 | 22.84096 | 22.90644 | 22.97931 | 23.04600 | 23.11333 | 23.18776 | 23.25764 | 23.32273 | 23.39526 | 23.46811 | 23.54160 | 23.61592 | 23.69486 | 23.77659 | 23.86256 | 23.95294 | 24.05243 | 24.15957 | 24.27001 | 24.38270 | 24.48846 | 24.59620 |
| Andorra | 25.66652 | 25.70868 | 25.74681 | 25.78250 | 25.81874 | 25.85236 | 25.89089 | 25.93414 | 25.98477 | 26.04450 | 26.10936 | 26.17912 | 26.24017 | 26.30356 | 26.36793 | 26.43569 | 26.50769 | 26.58255 | 26.66337 | 26.75078 | 26.83179 | 26.92373 | 27.02525 | 27.12481 | 27.23107 | 27.32827 | 27.43588 | 27.53363 | 27.63048 |
| Angola | 20.94876 | 20.94371 | 20.93754 | 20.93187 | 20.93569 | 20.94857 | 20.96030 | 20.98025 | 21.01375 | 21.05269 | 21.09007 | 21.12136 | 21.14987 | 21.13938 | 21.14186 | 21.16022 | 21.19076 | 21.22621 | 21.27082 | 21.31954 | 21.37480 | 21.43664 | 21.51765 | 21.59924 | 21.69218 | 21.80564 | 21.93881 | 22.08962 | 22.25083 |
| Antigua and Barbuda | 23.31424 | 23.39054 | 23.45883 | 23.53735 | 23.63584 | 23.73109 | 23.83449 | 23.93649 | 24.05364 | 24.16347 | 24.26782 | 24.36568 | 24.45644 | 24.54096 | 24.60945 | 24.66461 | 24.72544 | 24.78714 | 24.84936 | 24.91721 | 24.99158 | 25.05857 | 25.13039 | 25.20713 | 25.29898 | 25.39965 | 25.51382 | 25.64247 | 25.76602 |
| Argentina | 25.37913 | 25.44951 | 25.50242 | 25.55644 | 25.61271 | 25.66593 | 25.72364 | 25.78529 | 25.84428 | 25.88510 | 25.92482 | 25.99177 | 26.07642 | 26.17288 | 26.27872 | 26.37522 | 26.47182 | 26.57778 | 26.68714 | 26.79005 | 26.88103 | 26.96067 | 26.99882 | 27.04738 | 27.11001 | 27.18941 | 27.28179 | 27.38889 | 27.50170 |
| Armenia | 23.82469 | 23.86401 | 23.91023 | 23.95649 | 24.00181 | 24.04083 | 24.08736 | 24.13334 | 24.17219 | 24.19556 | 24.20618 | 24.19790 | 24.12982 | 24.05854 | 24.02297 | 24.01570 | 24.02627 | 24.03885 | 24.07100 | 24.11699 | 24.18045 | 24.26670 | 24.37698 | 24.50332 | 24.64178 | 24.81447 | 24.99160 | 25.17590 | 25.35542 |
Separating columns
The separate() function allows you to separate one column into multiple columns. Unless you tell it otherwise, it will attempt to separate on any character that is not a letter or number. You can also specify a specific separator using the sep argument.
We’ve loaded the small dataset called treatments into your workspace. This dataset obeys the principles of tidy data, but we’d like to split the treatment dates into two separate columns: year and month. This can be accomplished with the following:
separate(treatments, year_mo, c("year", "month"))
# Apply separate() to bmi_cc
bmi_cc_clean <- separate(bmi_cc, col = Country_ISO, into = c("Country", "ISO"), sep = "/")
# Print the head of the result
head(bmi_cc_clean)Uniting columns
The opposite of separate() is unite(), which takes multiple columns and pastes them together. By default, the contents of the columns will be separated by underscores in te new column, but this behavior can be altered via the sep argument.
We’ve loaded the treatments data into your workspace again, but this time the year_mo column has been separated into year and month. The original column can be recreated by putting year and month back together:
unite(treatments, year_mo, year, month)
# Apply unite() to bmi_cc_clean
bmi_cc <- unite(bmi_cc_clean, Country_ISO, Country, ISO, sep = "-")
# View the head of the result
head(bmi_cc)Column headers are values, not variable names
You saw earlier in the chapter how we sometimes come across datasets where column names are actually values of a variable (e.g. months of the year). This is often the case when working with repeated measures data, where measurements are taken on subjects of interest on multiple occasions over time. The gather() function is helpful in these situations.
# Read Census data
census <- read_csv("../xDatasets/census-retail.csv")
# View the head of census
census %>%
head() %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| YEAR | JAN | FEB | MAR | APR | MAY | JUN | JUL | AUG | SEP | OCT | NOV | DEC |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1992 | 146913 | 147270 | 146831 | 148082 | 149015 | 149821 | 150809 | 151064 | 152595 | 153577 | 153605 | 155504 |
| 1993 | 157525 | 156292 | 154774 | 158996 | 160624 | 160171 | 162832 | 162491 | 163285 | 164711 | 166593 | 168101 |
| 1994 | 167504 | 169652 | 172775 | 173099 | 172340 | 174307 | 174801 | 177289 | 178776 | 180569 | 180695 | 181492 |
| 1995 | 182423 | 179472 | 180996 | 181702 | 183543 | 186088 | 185470 | 186814 | 187338 | 186546 | 189052 | 190809 |
| 1996 | 189167 | 192269 | 193993 | 194712 | 196210 | 196127 | 196229 | 196215 | 198843 | 200488 | 200200 | 201191 |
| 1997 | 202414 | 204273 | 204965 | 203372 | 201676 | 204666 | 207049 | 207643 | 208298 | 208064 | 208982 | 209379 |
# Gather the month columns
library(tidyr)
census2 <- gather(census, month, amount, -YEAR)
# Arrange rows by YEAR using dplyr's arrange
census2_arr <- arrange(census2, YEAR)
# View first 20 rows of census2
census2_arr%>%
head(12) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| YEAR | month | amount |
|---|---|---|
| 1992 | JAN | 146913 |
| 1992 | FEB | 147270 |
| 1992 | MAR | 146831 |
| 1992 | APR | 148082 |
| 1992 | MAY | 149015 |
| 1992 | JUN | 149821 |
| 1992 | JUL | 150809 |
| 1992 | AUG | 151064 |
| 1992 | SEP | 152595 |
| 1992 | OCT | 153577 |
| 1992 | NOV | 153605 |
| 1992 | DEC | 155504 |
♀Variables are stored in both rows and columns
Sometimes you’ll run into situations where variables are stored in both rows and columns. To illustrate this, we’ve loaded the pets dataset from the video, which tells us in a convoluted way how many birds, cats, and dogs Jason, Lisa, and Terrence have. Print the pets dataset to see for yourself.
Although it may not be immediately obvious, if we treat the values in the type column as variables and create a separate column for each of them, we can set things straight. To do this, we use the spread() function. Run the following code to see for yourself:
spread(pets, type, num)
The result shows the exact same information in a much clearer way! Notice that the spread() function took in three arguments. The first argument takes the name of your messy dataset (pets), the second argument takes the name of the column to spread into new columns (type), and the third argument takes the column that contains the value with which to fill in the newly spread out columns (num).
Now let’s try this on a new messy dataset census_long. What information does this tell us?
# View first 50 rows of census_long
head(census_long, n = 50)
# Spread the type column
census_long2 <- spread(census_long, type, amount)
# View first 20 rows of census_long2
head(census_long2, n = 20)Multiple values are stored in one column
It’s also fairly common that you will find two variables stored in a single column of data. These variables may be joined by a separator like a dash, underscore, space, or forward slash.
The separate() function comes in handy in these situations. To practice using it, we have created a slight modification of last exercise’s result. Keep in mind that the into argument, which specifies the names of the 2 new columns being formed, must be given as a character vector (e.g. c(“column1”, “column2”)).
# View the head of census_long3
head(census_long3)
# Separate the yr_month column into two
census_long4 <- separate(census_long3, yr_month, c("year", "month"))
# View the first 6 rows of the result
head(census_long4, n = 6)